Hello World under Uncertainty!¶
In this notebook, we will use the FoSRL library to train a safe policy in a uncertain control-affine system.
We consider a linear dynamical system with additive uncertainty: $$ \dot{x} = A \, x + B \, u + z $$
where x=[px, py], u=[vx, vy] and z=[zx, zy].
Among many types of uncertainty, we demonstrate the use of FoSRL on additive bounded uncertainty.
As already shown in the previous tutorial, FoSRL operates in two distinct phases:
- An offline counter-example guided pretraining phase, where FoSRL is going to build a Control Barrier Function (CBF) to remain within the set of safe states.
- An online safe interactive learning phase, where FoSRL is going to iteratively (and safely) explore the environment, collect the reward signal, and update the policy to maximize the reward collection.
The changes to cope with uncertainty mainly affect the first phase, while the second remains the same as in the previous tutorial.
# Import the necessary libraries
import numpy as np
import torch
from fosco.systems import make_system
from fosco.systems.uncertainty import add_uncertainty
from fosco.config import CegisConfig
from fosco.cegis import Cegis
# plotting utilities
from fosco.plotting.domains import plot_domain
from fosco.plotting.constants import DOMAIN_COLORS
from plotly.graph_objs import Figure
%matplotlib widget
seed = 42
verbosity = 1
Offline Counter-example guided Pretraining¶
Define the symbolic assumptions on the dynamics and uncertainty¶
To enable the counter-example guided pretraining, we need to define a set of symbolic assumptions on the environment dynamics that we use to learn and verify the CBF.
For all systems, the assumptions consist of the characterization of the state and spaces, and sub-domains for initial, unsafe and other states. Moreover, since we are considering an uncertain system, we need to define the uncertainty domain.
Each domain must be a symbolic set. We offer several implementations of common multi-dimensional sets, such as:
- Rectangle, Sphere,
- Union, Intersection, Complement,
- and others.
# define control-affine dynamical system
system_id = "SingleIntegrator"
uncertainty_type = "AdditiveBounded"
system = make_system(system_id)()
system = add_uncertainty(uncertainty_type, system=system)
print(type(system))
<class 'fosco.systems.uncertainty.additive_bounded.AdditiveBounded'>
# define state domains and input domains
# and initial, unsafe, lie state domains
domains = system.domains
print("Domains: ", list(domains.keys()), "\n")
for k, dom in domains.items():
print(f"{k}: {dom}")
# Visualization of state domains
fig = Figure()
for dname, domain in domains.items():
if dname in DOMAIN_COLORS:
color = DOMAIN_COLORS[dname] if dname in DOMAIN_COLORS else None
fig = plot_domain(domain, fig, color=color, label=dname)
fig.update_traces(showlegend=True)
fig.show()
Domains: ['lie', 'input', 'init', 'unsafe', 'uncertainty'] lie: Rectangle((-5.0, -5.0), (5.0, 5.0)) input: Rectangle((-5.0, -5.0), (5.0, 5.0)) init: Complement(Rectangle((-4.0, -4.0), (4.0, 4.0))) unsafe: Sphere((0.0, 0.0), 1.0) uncertainty: Sphere((0.0, 0.0), 1.0)
For this system, there are now five domains:
input: the domain of control actions;uncertainty: the domain of the uncertainty.init: the domain of initial states, for which we want the CBF to be positive;unsafe: the domain of unsafe states, for which we want the CBF to be negative;lie: the domain of all the states, for which we will try to enforce the CBF condition on Lie derivative.
Define the numerical data¶
Having defined the symbolic expressions for the verification of the CBF, we need their numerical counter-parts to have training data for the learning.
Here, for each of the state domains, we are going to define a dataset of samples:
init: a dataset of initial states;unsafe: a dataset of unsafe states;lie: a dataset of (state, action) pairs.
Note: we do not directly define the training data, because the tool expects to get generator functions for each of them.
# data generator
from fosco.common.consts import DomainName as dn
data_gen = {
'init': lambda n: domains[dn.XI.value].generate_data(n),
'unsafe': lambda n: domains[dn.XU.value].generate_data(n),
'lie': lambda n: torch.concatenate([
domains["lie"].generate_data(n),
domains["input"].generate_data(n),
domains["uncertainty"].generate_data(n),
], dim=1
),
'uncertainty': lambda n: torch.concatenate([
domains["lie"].generate_data(n),
domains["input"].generate_data(n),
domains["uncertainty"].generate_data(n),
],
dim=1,
)
}
Define the configuration¶
It remains to define the configuration of the pretraining and its hyper-parameters.
There are two important changes to the configuration:
- the
CERTIFICATEis now set torcbfto indicate we are looking for a Robust CBF; - the
LOSS_WEIGHTSincludes the uncertainty loss term and an additional regularization term.
config = CegisConfig(
SEED=seed, # the seed for reproducibility
CERTIFICATE="rcbf", # the type of certificate, either cbf or rcbf
VERIFIER="z3", # the type of verifier, either z3 or dreal
ACTIVATION=["htanh"], # the activation of the i-th hidden layer
N_HIDDEN_NEURONS=[20], # the nr of neurons of the i-th hidden layer
CEGIS_MAX_ITERS=20, # the maximum number of iterations
N_DATA=5000, # the nr of samples in each training dataset
RESAMPLING_N=100, # the nr of points to sample around each counter-example
RESAMPLING_STDDEV=0.1, # the std deviation to sample around each counter-example
LOSS_WEIGHTS={ # the weights for each loss term
'init': 1.0,
'unsafe': 1.0,
'lie': 1.0,
'robust': 1.0,
'conservative_b': 1.0,
'conservative_sigma': 0.1
},
)
Let us spend some words on the loss weights.
For finding a candidate CBF, we minimize the following loss
$$ \mathcal{L} = \lambda_{init} \, \mathcal{L}_{init} + \lambda_{unsafe} \, \mathcal{L}_{unsafe} + \lambda_{lie} \, \mathcal{L}_{lie} + \lambda_{robust} \, \mathcal{L}_{robust} + \lambda_{reg-b} \, \mathcal{L}_{reg-b} + \lambda_{reg-s} \, \mathcal{L}_{reg-s} $$where:
- $\mathcal{L}_{init}$ penalizes counter-examples in the dataset of initial states;
- $\mathcal{L}_{unsafe}$ penalizes counter-examples in the dataset of unsafe states;
- $\mathcal{L}_{lie}$ penalizes counter-examples in the lie dataset;
- $\mathcal{L}_{robust}$ penalizes counter-examples in the robust dataset;
- $\mathcal{L}_{reg-b}$ penalizes states where the CBF is negative, to discourage overconservative CBF functions.
- $\mathcal{L}_{reg-s}$ penalizes states where the compensator is positive, to discourage overconservative compensator functions.
from fosco.plotting.functions import plot_torch_function
cegis = Cegis(
system=system,
domains=domains,
config=config,
data_gen=data_gen,
verbose=verbosity
)
result = cegis.solve()
INFO:fosco.cegis:Seed: 42 INFO:fosco.cegis:Iteration 1 INFO:fosco.verifier.verifier:init: Counterexample Found: [x0, x1] = tensor([-4.6153, 4.4458]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5101, unsafe: 5000, lie: 5000, uncertainty: 5000 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 2 INFO:fosco.verifier.verifier:init: Counterexample Found: [x0, x1] = tensor([-4.3367, 4.8890]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5000 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 3 INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-2.1406, -0.7266, -4.0000, -4.0000, 0.4297, 0.9026]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5101 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 4 INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-0.8906, 1.2500, -4.0000, 3.0000, 0.0000, -1.0000]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5202 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 5 INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-2.0000, 0.5000, -4.0000, -4.0000, 1.0000, 0.0000]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5303 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 6 INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1., -1., -4., -4., 1., 0.]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5404 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 7 INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1., 1., -4., -2., 1., 0.]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5505 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 8 INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1., -1., -4., -4., 1., 0.]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5606 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 9 INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-0.5000, -1.3750, -4.0000, -4.0000, 1.0000, 0.0000]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5707 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 10 INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.0000, 1.1250, -4.0000, -3.0000, 1.0000, 0.0000]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5808 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 11 INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.0000, 1.1250, -4.0000, -3.0000, 1.0000, 0.0000]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 5909 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 12 INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.0000, 1.1250, -4.0000, -3.0000, 1.0000, 0.0000]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 6010 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 13 INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.0000, 1.1250, -4.0000, -3.0000, 1.0000, 0.0000]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 6111 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 14 INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.5000, 0.0000, -4.0000, -4.0000, 1.0000, 0.0000]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 6212 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 15 INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.5000, 0.0000, -4.0000, -4.0000, 1.0000, 0.0000]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 6313 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 16 INFO:fosco.verifier.verifier:uncertainty: Counterexample Found: [x0, x1, u0, u1, z0, z1] = tensor([-1.5625, 0.0000, -4.0000, -4.0000, 1.0000, 0.0000]), [] = tensor([]) INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 6414 INFO:fosco.cegis: INFO:fosco.cegis:Iteration 17 INFO:fosco.verifier.verifier:No counterexamples found! INFO:fosco.consolidator.consolidator:Dataset sizes: init: 5202, unsafe: 5000, lie: 5000, uncertainty: 6414 INFO:fosco.cegis:CEG Pretraining finished after 17 iterations
import plotly.graph_objects as go
fig = go.Figure(layout=dict(width=1000, height=1000))
fig = plot_torch_function(
function=result.barrier,
domains=system.domains,
fig=fig,
)
fig.show()
fig = go.Figure(layout=dict(width=1000, height=1000))
fig = plot_torch_function(
function=result.compensator,
domains=system.domains,
fig=fig,
)
fig.show()
Online Safe Interactive Learning¶
Starting from the Robust CBF found in the previous phase, we can now proceed with the policy training.
Gymnasium Wrapper¶
As common in RL libraries, we adopt the gymnasium api to simulate the system.
To make the continuous-time CBF formulation to work in discretized simulation, we use a small time step.
import gymnasium as gym
from fosco.systems.gym_env.system_env import SystemEnv
from fosco.systems.gym_env.rewards import GoToUnsafeReward
from rl_trainer.wrappers.record_episode_statistics import RecordEpisodeStatistics
max_steps = 100
sim_dt = 0.1
num_envs = 3
def make_env(seed, render_mode=None):
def thunk():
env = SystemEnv(
system=system,
dt=sim_dt,
max_steps=max_steps,
reward_fn=GoToUnsafeReward(system=system),
render_mode=render_mode
)
env.action_space.seed(seed)
env = RecordEpisodeStatistics(env)
env = gym.wrappers.NormalizeReward(env)
env = gym.wrappers.TransformReward(env, lambda reward: np.clip(reward, -10, 10))
return env
return thunk
envs = gym.vector.SyncVectorEnv(
[
make_env(seed=seed) for i in range(num_envs)
]
)
Define Safe Policy¶
from rl_trainer.ppo.ppo_config import PPOConfig
config = PPOConfig()
config.num_envs = num_envs
config.num_steps = 2048
config.total_timesteps = 200000
import plotly.graph_objects as go
from rl_trainer.safe_ppo.safeppo_trainer import SafePPOTrainer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
trainer = SafePPOTrainer(envs=envs, config=config, barrier=result.barrier, compensator=result.compensator, device=device)
Training¶
results = trainer.train(envs=envs, verbose=verbosity)
INFO:rl_trainer.ppo.ppo_trainer:iteration 1/32 INFO:rl_trainer.ppo.ppo_trainer:SPS: 183 INFO:rl_trainer.ppo.ppo_trainer:iteration 2/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 6144/200000 episodic returns: -532.93 +/- 97.55 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 176 INFO:rl_trainer.ppo.ppo_trainer:iteration 3/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 12288/200000 episodic returns: -490.32 +/- 112.82 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 174 INFO:rl_trainer.ppo.ppo_trainer:iteration 4/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 18432/200000 episodic returns: -487.87 +/- 64.57 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 163 INFO:rl_trainer.ppo.ppo_trainer:iteration 5/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 24576/200000 episodic returns: -424.07 +/- 86.58 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 163 INFO:rl_trainer.ppo.ppo_trainer:iteration 6/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 30720/200000 episodic returns: -438.14 +/- 97.04 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 159 INFO:rl_trainer.ppo.ppo_trainer:iteration 7/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 36864/200000 episodic returns: -380.73 +/- 58.51 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 157 INFO:rl_trainer.ppo.ppo_trainer:iteration 8/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 43008/200000 episodic returns: -361.75 +/- 54.25 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 154 INFO:rl_trainer.ppo.ppo_trainer:iteration 9/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 49152/200000 episodic returns: -370.12 +/- 73.51 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 152 INFO:rl_trainer.ppo.ppo_trainer:iteration 10/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 55296/200000 episodic returns: -314.30 +/- 41.00 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 150 INFO:rl_trainer.ppo.ppo_trainer:iteration 11/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 61440/200000 episodic returns: -293.47 +/- 35.19 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 148 INFO:rl_trainer.ppo.ppo_trainer:iteration 12/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 67584/200000 episodic returns: -267.14 +/- 32.23 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 149 INFO:rl_trainer.ppo.ppo_trainer:iteration 13/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 73728/200000 episodic returns: -271.40 +/- 28.44 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 147 INFO:rl_trainer.ppo.ppo_trainer:iteration 14/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 79872/200000 episodic returns: -263.61 +/- 23.42 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 147 INFO:rl_trainer.ppo.ppo_trainer:iteration 15/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 86016/200000 episodic returns: -251.17 +/- 16.95 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 148 INFO:rl_trainer.ppo.ppo_trainer:iteration 16/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 92160/200000 episodic returns: -256.43 +/- 20.48 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 146 INFO:rl_trainer.ppo.ppo_trainer:iteration 17/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 98304/200000 episodic returns: -250.46 +/- 17.90 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 147 INFO:rl_trainer.ppo.ppo_trainer:iteration 18/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 104448/200000 episodic returns: -243.88 +/- 23.26 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 145 INFO:rl_trainer.ppo.ppo_trainer:iteration 19/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 110592/200000 episodic returns: -234.85 +/- 15.38 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 146 INFO:rl_trainer.ppo.ppo_trainer:iteration 20/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 116736/200000 episodic returns: -231.38 +/- 19.23 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 145 INFO:rl_trainer.ppo.ppo_trainer:iteration 21/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 122880/200000 episodic returns: -231.78 +/- 11.90 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 144 INFO:rl_trainer.ppo.ppo_trainer:iteration 22/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 129024/200000 episodic returns: -238.40 +/- 17.69 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 141 INFO:rl_trainer.ppo.ppo_trainer:iteration 23/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 135168/200000 episodic returns: -233.04 +/- 14.28 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 142 INFO:rl_trainer.ppo.ppo_trainer:iteration 24/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 141312/200000 episodic returns: -227.16 +/- 13.14 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 142 INFO:rl_trainer.ppo.ppo_trainer:iteration 25/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 147456/200000 episodic returns: -236.40 +/- 12.52 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 142 INFO:rl_trainer.ppo.ppo_trainer:iteration 26/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 153600/200000 episodic returns: -232.63 +/- 16.54 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 142 INFO:rl_trainer.ppo.ppo_trainer:iteration 27/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 159744/200000 episodic returns: -222.71 +/- 12.68 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 142 INFO:rl_trainer.ppo.ppo_trainer:iteration 28/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 165888/200000 episodic returns: -228.34 +/- 17.19 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 142 INFO:rl_trainer.ppo.ppo_trainer:iteration 29/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 172032/200000 episodic returns: -226.56 +/- 9.05 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 143 INFO:rl_trainer.ppo.ppo_trainer:iteration 30/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 178176/200000 episodic returns: -232.41 +/- 14.47 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 143 INFO:rl_trainer.ppo.ppo_trainer:iteration 31/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 184320/200000 episodic returns: -228.50 +/- 10.80 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 143 INFO:rl_trainer.ppo.ppo_trainer:iteration 32/32 INFO:rl_trainer.ppo.ppo_trainer:Last 10 episodes: global step: 190464/200000 episodic returns: -221.86 +/- 13.60 episodic costs: 0.00 +/- 0.00 episodic lengths: 100.00 +/- 0.00 INFO:rl_trainer.ppo.ppo_trainer:SPS: 143
# process training metrics
train_steps = results["train_steps"]
train_returns = results["train_returns"]
train_costs = results["train_costs"]
# group by steps (possible duplicate in steps due to vectorization)
train_steps, train_returns, train_returns_std, train_costs, train_costs_std = zip(*sorted(
[(step,
np.mean([r for r, s in zip(train_returns, train_steps) if s == step]),
np.std([r for r, s in zip(train_returns, train_steps) if s == step]),
np.mean([c for c, s in zip(train_costs, train_steps) if s == step]),
np.std([c for c, s in zip(train_costs, train_steps) if s == step])) for step in set(train_steps)]
))
# load baseline data for comparison
baseline_data = np.genfromtxt("data/results_nocbf.csv", delimiter=",", comments="#")[1:]
baseline_steps = baseline_data[:, 0]
baseline_returns = baseline_data[:, 1]
baseline_costs = baseline_data[:, 2]
# group by steps (possible duplicate in steps due to vectorization)
baseline_steps, baseline_returns, baseline_returns_std, baseline_costs, baseline_costs_std = zip(*sorted(
[(step,
np.mean([r for r, s in zip(baseline_returns, baseline_steps) if s == step]),
np.std([r for r, s in zip(baseline_returns, baseline_steps) if s == step]),
np.mean([c for c, s in zip(baseline_costs, baseline_steps) if s == step]),
np.std([c for c, s in zip(baseline_costs, baseline_steps) if s == step])) for step in set(train_steps)]
))
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
axes[0].plot(train_steps, train_returns)
axes[0].fill_between(train_steps, np.array(train_returns) - np.array(train_returns_std), np.array(train_returns) + np.array(train_returns_std), alpha=0.2)
axes[0].plot(baseline_steps, baseline_returns)
axes[0].fill_between(baseline_steps,
np.array(baseline_returns) - np.array(baseline_returns_std),
np.array(baseline_returns) + np.array(baseline_returns_std), alpha=0.2)
axes[0].set_xlabel("Steps")
axes[0].set_title("Episodic Return")
axes[1].plot(train_steps, train_costs)
axes[1].fill_between(train_steps, np.array(train_costs) - np.array(train_costs_std), np.array(train_costs) + np.array(train_costs_std), alpha=0.2)
axes[1].plot(baseline_steps, baseline_costs, label="PPO")
axes[1].fill_between(baseline_steps, np.array(baseline_costs) - np.array(baseline_costs_std),
np.array(baseline_costs) + np.array(baseline_costs_std), alpha=0.2)
axes[1].set_xlabel("Steps")
axes[1].set_title("Episodic Cost")
plt.show()
eval_envs = gym.vector.SyncVectorEnv([make_env(seed=seed, render_mode='rgb_array') for i in range(1)])
agent = trainer.get_actor()
obs, infos = eval_envs.reset()
done = False
frames = []
while not done:
action = agent.get_action_and_value(torch.Tensor(obs).to(device))["action"]
obs, reward, term, trunc, infos = eval_envs.step(action.detach().cpu().numpy())
done = term[0] or trunc[0]
frame = eval_envs.envs[0].render()
frames.append(frame)
# save gif
import imageio
imageio.mimsave("safe_robust_policy.mp4", frames, fps=10)
from IPython.display import Video
Video("safe_robust_policy.mp4")
import time
t0 = time.time()
batch_size = 100
print(f"Seed {seed}")
env = SystemEnv(
system=system,
dt=sim_dt,
max_steps=max_steps,
)
obs, info = env.reset(
seed=seed, options={"batch_size": batch_size, "return_as_np": False}
)
terminations = truncations = np.zeros(batch_size, dtype=bool)
traj = {"x": [obs], "u": []}
while not (any(terminations) or any(truncations)):
# with torch.no_grad():
obs = obs[None] if len(obs.shape) == 1 else obs
u = agent.get_action_and_value(torch.Tensor(obs).to(device))["action"]
u = u.detach().numpy()
obs, rewards, terminations, truncations, infos = env.step(u)
traj["x"].append(obs)
traj["u"].append(u)
print(f"Sim time: {time.time() - t0} seconds")
Seed 42 Sim time: 18.42903470993042 seconds
traj["x"] = np.array(traj["x"])
traj["u"] = np.array(traj["u"])
fig, ax1 = plt.subplots(1, 1, figsize=(10, 10))
for i in range(traj["x"].shape[1]):
xs = traj["x"][:, i, 0]
ys = traj["x"][:, i, 1]
ax1.plot(xs, ys, color="blue")
ax1.scatter(xs[0], ys[0], marker="x", color="k")
# draw circle unsafe set
cx, r = system.domains["unsafe"].center, system.domains["unsafe"].radius
ax1.plot(
cx[0] + r * np.cos(np.linspace(0, 2 * np.pi, 25)),
cx[1] + r * np.sin(np.linspace(0, 2 * np.pi, 25)),
color="r",
linestyle="dashed",
label="obstacle",
)
ax1.set_title("Space Trajectories")
ax1.set_xlabel("x[0]")
ax1.set_ylabel("x[1]")
ax1.set_xlim(-5, +5)
ax1.set_ylim(-5, +5)
ax1.axis("equal")
ax1.invert_yaxis()
ax1.legend()
<matplotlib.legend.Legend at 0x7f160c212e00>